{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "markdown"
    }
   },
   "source": [
    "# Basic RAG from Scratch\n",
    "\n",
    "<a href=\"https://aiengineering.academy/\" target=\"_blank\">\n",
    "<img src=\"https://raw.githubusercontent.com/adithya-s-k/AI-Engineering.academy/main/assets/banner.png\" alt=\"Ai Engineering. Academy\" width=\"50%\">\n",
    "</a>\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adithya-s-k/AI-Engineering.academy/blob/main/RAG/01_Basic_RAG/basic_rag_scratch.ipynb)\n",
    "[![GitHub Stars](https://img.shields.io/github/stars/adithya-s-k/AI-Engineering.academy?style=social)](https://github.com/adithya-s-k/AI-Engineering.academy/stargazers)\n",
    "[![GitHub Forks](https://img.shields.io/github/forks/adithya-s-k/AI-Engineering.academy?style=social)](https://github.com/adithya-s-k/AI-Engineering.academy/network/members)\n",
    "[![GitHub Issues](https://img.shields.io/github/issues/adithya-s-k/AI-Engineering.academy)](https://github.com/adithya-s-k/AI-Engineering.academy/issues)\n",
    "[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/adithya-s-k/AI-Engineering.academy)](https://github.com/adithya-s-k/AI-Engineering.academy/pulls)\n",
    "[![License](https://img.shields.io/github/license/adithya-s-k/AI-Engineering.academy)](https://github.com/adithya-s-k/AI-Engineering.academy/blob/main/LICENSE)\n",
    "\n",
    "\n",
    "This notebook implements a basic Retrieval-Augmented Generation (RAG) system from scratch, without relying on external libraries except for essential system-level functionalities. This approach focuses on demonstrating the core concepts of RAG using fundamental Python operations.\n",
    "\n",
    "\n",
    "**Core Steps:**\n",
    "\n",
    "1. **Data Loading:** Read text data from a file.\n",
    "2. **Chunking:** Split the text into manageable chunks.\n",
    "3. **Embedding Simulation:** Create simple numerical representations (simulated embeddings).\n",
    "4. **Semantic Search (Similarity):** Implement a basic similarity calculation.\n",
    "5. **Response Generation (Placeholder):** Use a simple string concatenation as a placeholder for LLM response.\n",
    "6. **Evaluation (Basic String Matching):** Evaluate the generated response against a known answer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up the Environment\n",
    "We begin by importing necessary libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import fitz\n",
    "import numpy as np\n",
    "import json\n",
    "import os\n",
    "from litellm import completion, embedding\n",
    "\n",
    "# plain openai also can be used\n",
    "# from openai import OpenAI\n",
    "\n",
    "# initilize openai client\n",
    "# client = OpenAI(,\n",
    "#     api_key=os.getenv(\"OPENAI_API_KEY\")  # Retrieve the API key from environment variables\n",
    "# )\n",
    "\n",
    "# we are using litellm as it allows us to easily switch between different LLM providers\n",
    "# and is compatible with the same API\n",
    "\n",
    "# Configure API keys (replace with your actual keys)\n",
    "os.environ['OPENAI_API_KEY'] = \"\"  # Replace with your OpenAI API key\n",
    "os.environ['ANTHROPIC_API_KEY'] = \"\" # Replace with your Anthropic API key\n",
    "os.environ['GROQ_API_KEY'] = \"\" # Replace with your Groq API key\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting Text from a PDF \n",
    "We extract text from a PDF file using PyMuPDF. This process involves opening the PDF, reading its contents, and converting them into a format suitable for further processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_text_from_pdf(pdf_path):\n",
    "    \"\"\"\n",
    "    Extracts and consolidates text from all pages of a PDF file. This is the first step in the RAG pipeline,\n",
    "    where we acquire the raw textual data that will later be processed, embedded, and retrieved against.\n",
    "\n",
    "    Args:\n",
    "    pdf_path (str): Path to the PDF file to be processed.\n",
    "\n",
    "    Returns:\n",
    "    str: Complete extracted text from all pages of the PDF, concatenated into a single string.\n",
    "         This raw text will be further processed in subsequent steps of the RAG pipeline.\n",
    "    \"\"\"\n",
    "    # Open the PDF file\n",
    "    mypdf = fitz.open(pdf_path)\n",
    "    all_text = \"\"  # Initialize an empty string to store the extracted text\n",
    "\n",
    "    # Iterate through each page in the PDF\n",
    "    for page_num in range(mypdf.page_count):\n",
    "        page = mypdf[page_num]  # Get the page\n",
    "        text = page.get_text(\"text\")  # Extract text from the page\n",
    "        all_text += text  # Append the extracted text to the all_text string\n",
    "\n",
    "    return all_text  # Return the extracted text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chunking the Extracted Text\n",
    "Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def chunk_text(text, n, overlap):\n",
    "    \"\"\"\n",
    "    Divides text into smaller, overlapping chunks for more effective processing and retrieval.\n",
    "    Chunking is a critical step in RAG systems as it:\n",
    "    1. Makes large documents manageable for embedding models that have token limits\n",
    "    2. Enables more precise retrieval of relevant information\n",
    "    3. Allows for contextual understanding within reasonable boundaries\n",
    "    \n",
    "    The overlap between chunks helps maintain context continuity and reduces the risk of\n",
    "    splitting important information across chunk boundaries.\n",
    "\n",
    "    Args:\n",
    "    text (str): The complete text to be chunked.\n",
    "    n (int): The maximum number of characters in each chunk.\n",
    "    overlap (int): The number of overlapping characters between consecutive chunks.\n",
    "                   Higher overlap improves context preservation but increases redundancy.\n",
    "\n",
    "    Returns:\n",
    "    List[str]: A list of text chunks that will be individually embedded and used for retrieval.\n",
    "    \"\"\"\n",
    "    chunks = []  # Initialize an empty list to store the chunks\n",
    "    \n",
    "    # Loop through the text with a step size of (n - overlap)\n",
    "    for i in range(0, len(text), n - overlap):\n",
    "        # Append a chunk of text from index i to i + n to the chunks list\n",
    "        chunks.append(text[i:i + n])\n",
    "\n",
    "    return chunks  # Return the list of text chunks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting and Chunking Text from a PDF File\n",
    "Now, we load the PDF, extract text, and split it into chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the path to the PDF file\n",
    "pdf_path = \"data/AI_Information.pdf\"\n",
    "\n",
    "# Extract text from the PDF file\n",
    "extracted_text = extract_text_from_pdf(pdf_path)\n",
    "\n",
    "# Chunk the extracted text into segments of 1000 characters with an overlap of 200 characters\n",
    "text_chunks = chunk_text(extracted_text, 1000, 200)\n",
    "\n",
    "# Print the number of text chunks created\n",
    "print(\"Number of text chunks:\", len(text_chunks))\n",
    "\n",
    "# Print the first text chunk\n",
    "print(\"\\nFirst text chunk:\")\n",
    "print(text_chunks[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Embeddings for Text Chunks\n",
    "Embeddings transform text into numerical vectors, which allow for efficient similarity search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_embeddings(text, model=\"text-embedding-ada-002\"):\n",
    "    \"\"\"\n",
    "    Transforms text into dense vector representations (embeddings) using a neural network model.\n",
    "    Embeddings are the cornerstone of modern RAG systems because they:\n",
    "    1. Capture semantic meaning in a numerical format that computers can process\n",
    "    2. Enable similarity-based retrieval beyond simple keyword matching\n",
    "    3. Allow for efficient indexing and searching of large document collections\n",
    "    \n",
    "    In RAG, both document chunks and user queries are embedded in the same vector space,\n",
    "    allowing us to find the most semantically relevant chunks for a given query.\n",
    "\n",
    "    Args:\n",
    "    text (str or List[str]): The input text(s) to be embedded. Can be a single string or a list of strings.\n",
    "    model (str): The embedding model to use. Default is OpenAI's \"text-embedding-ada-002\".\n",
    "                 Different models offer various tradeoffs between quality, speed, and cost.\n",
    "\n",
    "    Returns:\n",
    "    dict: The response from the API containing the embeddings, which are high-dimensional\n",
    "          vectors representing the semantic content of the input text(s).\n",
    "    \"\"\"\n",
    "    # Create embeddings for the input text using the specified model\n",
    "    response = embedding(model=model, input=text)\n",
    "\n",
    "    return response  # Return the response containing the embeddings\n",
    "\n",
    "# Create embeddings for the text chunks\n",
    "response = create_embeddings(text_chunks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Performing Semantic Search\n",
    "We implement cosine similarity to find the most relevant text chunks for a user query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cosine_similarity(vec1, vec2):\n",
    "    \"\"\"\n",
    "    Calculates the cosine similarity between two vectors, which measures the cosine of the angle between them.\n",
    "    \n",
    "    Cosine similarity is particularly well-suited for RAG systems because:\n",
    "    1. It measures semantic similarity independent of vector magnitude (document length)\n",
    "    2. It ranges from -1 (completely opposite) to 1 (exactly the same), making it easy to interpret\n",
    "    3. It works well in high-dimensional spaces like those used for text embeddings\n",
    "    4. It's computationally efficient compared to some other similarity metrics\n",
    "\n",
    "    Args:\n",
    "    vec1 (np.ndarray): The first embedding vector.\n",
    "    vec2 (np.ndarray): The second embedding vector.\n",
    "\n",
    "    Returns:\n",
    "    float: The cosine similarity score between the two vectors, ranging from -1 to 1.\n",
    "           Higher values indicate greater semantic similarity between the original texts.\n",
    "    \"\"\"\n",
    "    # Compute the dot product of the two vectors and divide by the product of their norms\n",
    "    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def semantic_search(query, text_chunks, embeddings, k=5):\n",
    "    \"\"\"\n",
    "    Performs semantic search to find the most relevant text chunks for a given query.\n",
    "    This is the core retrieval component of the RAG system, responsible for:\n",
    "    1. Finding the most semantically relevant information from the knowledge base\n",
    "    2. Filtering out irrelevant content to improve generation quality\n",
    "    3. Providing the context that will be used by the LLM for response generation\n",
    "    \n",
    "    The quality of retrieval directly impacts the quality of the final generated response,\n",
    "    as the LLM can only work with the context it's provided.\n",
    "\n",
    "    Args:\n",
    "    query (str): The user's question or query text.\n",
    "    text_chunks (List[str]): The corpus of text chunks to search through.\n",
    "    embeddings (List[dict]): Pre-computed embeddings for each text chunk.\n",
    "    k (int): The number of top relevant chunks to retrieve. This parameter balances:\n",
    "             - Too low: May miss relevant information\n",
    "             - Too high: May include irrelevant information and exceed context limits\n",
    "\n",
    "    Returns:\n",
    "    List[str]: The top k most semantically relevant text chunks for the query,\n",
    "               which will be used as context for the LLM to generate a response.\n",
    "    \"\"\"\n",
    "    # Create an embedding for the query\n",
    "    query_embedding = create_embeddings(query).data[0].embedding\n",
    "    similarity_scores = []  # Initialize a list to store similarity scores\n",
    "\n",
    "    # Calculate similarity scores between the query embedding and each text chunk embedding\n",
    "    for i, chunk_embedding in enumerate(embeddings):\n",
    "        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))\n",
    "        similarity_scores.append((i, similarity_score))  # Append the index and similarity score\n",
    "\n",
    "    # Sort the similarity scores in descending order\n",
    "    similarity_scores.sort(key=lambda x: x[1], reverse=True)\n",
    "    # Get the indices of the top k most similar text chunks\n",
    "    top_indices = [index for index, _ in similarity_scores[:k]]\n",
    "    # Return the top k most relevant text chunks\n",
    "    return [text_chunks[index] for index in top_indices]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running a Query on Extracted Chunks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the validation data from a JSON file\n",
    "with open('data/val.json') as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "# Extract the first query from the validation data\n",
    "query = data[0]['question']\n",
    "\n",
    "# Perform semantic search to find the top 2 most relevant text chunks for the query\n",
    "top_chunks = semantic_search(query, text_chunks, response.data, k=2)\n",
    "\n",
    "# Print the query\n",
    "print(\"Query:\", query)\n",
    "\n",
    "# Print the top 2 most relevant text chunks\n",
    "for i, chunk in enumerate(top_chunks):\n",
    "    print(f\"Context {i + 1}:\\n{chunk}\\n=====================================\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating a Response Based on Retrieved Chunks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the system prompt for the AI assistant\n",
    "system_prompt = \"You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'\"\n",
    "\n",
    "def generate_response(system_prompt, user_message, model=\"meta-llama/Llama-3.2-3B-Instruct\"):\n",
    "    \"\"\"\n",
    "    Generates a contextually informed response using an LLM with the retrieved information.\n",
    "    This is the 'Generation' part of Retrieval-Augmented Generation, where:\n",
    "    1. The retrieved context is combined with the user query\n",
    "    2. The LLM synthesizes this information to produce a coherent, accurate response\n",
    "    3. The system prompt guides the model to stay faithful to the provided context\n",
    "    \n",
    "    By using retrieved information as context, the RAG system can:\n",
    "    - Provide up-to-date information beyond the LLM's training data\n",
    "    - Cite specific sources for its claims\n",
    "    - Reduce hallucination by grounding responses in retrieved facts\n",
    "    - Answer domain-specific questions with greater accuracy\n",
    "\n",
    "    Args:\n",
    "    system_prompt (str): Instructions that guide the AI's behavior and response style.\n",
    "                         In RAG, this typically instructs the model to use only the provided context.\n",
    "    user_message (str): The combined context and query to be sent to the LLM.\n",
    "                        This includes both the retrieved text chunks and the original user question.\n",
    "    model (str): The LLM to use for response generation. Default is \"meta-llama/Llama-3.2-3B-Instruct\".\n",
    "\n",
    "    Returns:\n",
    "    dict: The complete response from the LLM, containing the generated answer based on\n",
    "          the retrieved context and original query.\n",
    "    \"\"\"\n",
    "    response = completion(model=model, messages=[\n",
    "        {\"role\": \"system\", \"content\": system_prompt},\n",
    "        {\"role\": \"user\", \"content\": user_message}\n",
    "    ], temperature=0)\n",
    "    \n",
    "    # response = client.chat.completions.create(\n",
    "    #     model=model,\n",
    "    #     temperature=0,\n",
    "    #     messages=[\n",
    "    #         {\"role\": \"system\", \"content\": system_prompt},\n",
    "    #         {\"role\": \"user\", \"content\": user_message}\n",
    "    #     ]\n",
    "    # )\n",
    "    return response\n",
    "\n",
    "# Create the user prompt based on the top chunks\n",
    "user_prompt = \"\\n\".join([f\"Context {i + 1}:\\n{chunk}\\n=====================================\\n\" for i, chunk in enumerate(top_chunks)])\n",
    "user_prompt = f\"{user_prompt}\\nQuestion: {query}\"\n",
    "\n",
    "# Generate AI response\n",
    "ai_response = generate_response(system_prompt, user_prompt)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv-new-specific-rag",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
